Visualizing various types of data

Lecture 3

2024-05-20

Warm up

Questions ??

From last time

Violin plots

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_point()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  )

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  ) +
  scale_color_colorblind()

Questions from previous material

  • Is there any code in the videos that is not in the readings? Yes and no. There is no substantial functionality introduced in the videos that is not also in the readings, however the examples in the videos are different than the ones in the reading.

  • What are all of the geoms we need to know? You don’t need to “memorize” or even “know” all o the geoms available in the ggplot2 package, but you can find a list of them on the ggplot2 cheat sheet or on the reference page.

  • Could you please clarify what situations it would be appropriate to use each geom function? Today’s topic! And think about it as “what plot should I make for which type of variable”.

Let’s return to AE-02

ae-02-bechdel-dataviz

Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae. If there are any uncommitted files, commit them so you can start with a clean slate.

Recap of AE

  • Construct plots with ggplot().
  • Layers of ggplots are separated by +s.
  • The formula is (almost) always as follows:
ggplot(DATA, aes(x = X-VAR, y = Y-VAR, ...)) +
  geom_XXX()
  • Aesthetic attributes of a geometries (color, size, transparency, etc.) can be mapped to variables in the data or set by the user, e.g. color = binary vs. color = "pink".
  • Use facet_wrap() when faceting (creating small multiples) by one variable and facet_grid() when faceting by two variables.

Visualizing various types of data

Identifying variable types

Identify the type of each of the following variables.

  • Favorite food
  • Number of classes you’re taking this semester
  • Zip code
  • Age

The way data is displayed matters

What do these three plots show?

Visualizing penguins

library(tidyverse)
library(palmerpenguins)
library(ggthemes)

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Univariate analysis

Univariate analysis

Analyzing a single variable:

  • Numerical: histogram, box plot, density plot, etc.

  • Categorical: bar plot, pie chart, etc.

Histogram - Step 1

ggplot(
  penguins
  )

Histogram - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Histogram - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram()

Histogram - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  )

Histogram - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  ) +
  labs(
    title = "Weights of penguins",
    x = "Weight (grams)",
    y = "Count"
  )

Boxplot - Step 1

ggplot(
  penguins
  )

Boxplot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Boxplot - Step 3

ggplot(
  penguins,
  aes(y = body_mass_g)
  ) +
  geom_boxplot()

Boxplot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot()

Boxplot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot() +
  labs(
    x = "Weight (grams)",
    y = NULL
  )

Density plot - Step 1

ggplot(
  penguins
  )

Density plot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Density plot - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density()

Density plot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1"
  )

Density plot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2
  )

Density plot - Step 6

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3"
  )

Density plot - Step 7

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3",
    alpha = 0.5
  )

Weights of penguins

::: task ::: columns ::: {.column width=“70%”}

TRUE / FALSE

  • The distribution of penguin weights in this sample is left skewed.
  • The distribution of penguin weights in this sample is unimodal.

:::

Bivariate analysis

Bivariate analysis

Analyzing the relationship between two variables:

  • Numerical + numerical: scatterplot

  • Numerical + categorical: side-by-side box plots, violin plots, etc.

  • Categorical + categorical: stacked bar plots

  • Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    y = species
    )
  ) +
  geom_boxplot()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species
    )
  ) +
  geom_density()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  )

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  ) +
  theme(
    legend.position = "bottom"
  )